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1 Introduction 



In this modern information age analysis of large data sets is an integral part of both 
scientific research and practical problem-solving. High-dimensional time series occur 
in many fields including, among others, finance, economics, environmental and medical 
studies. For example, to understand the dynamics of the returns of large number of 
assets is the key for portfolio allocation, pricing and risk management. Panel time series 
are common place in studying economic and business phenomena. Environmental time 
series are often of a high- dimension because of the large number of indices monitored over 
many different locations. On the other hand, the conventional time series models such as 
vector AR or ARMA are not practically viable without a proper regularization when the 
dimension is high, as the number of parameters involved is a multiple of the square of 
the dimension. Hence it is pertinent to reduce the dimension of the data before making 
further analysis. Different from the dimension-reduction for independent observations, 
the challenge here is to retain the dynamical structure of the time series. 

Using common factors is one of the most frequently used and effective ways to achieve 
dimension-reduction in a nalyzing multip 



e time series. Early a t tempts in this d irec- 
tion include, for exam ple, lAndersonl (Il963f ). iPriestley et al.l (1 19741 ). iBrillingerl (ll98lf ) and 



Pena and Box 



(I1987I ). To deal with the new challenge resulted from the fact that the 
number of time series p may be as large as, or even larger than, the length of time series n 
(such as most panel data), more recent effort (mainly in ec onometrics) focuses on the in 



ferenc e when p goes 
(1198.4 



to oo 



Chamberlain 



f|1983h 



ong with ri 



Bail (120031 ) 



. See, for example 



Forni et al 



( 12000 



Chamber 



2004 



ain and Rothschild 



20051 ) . Furthermore mo- 



tivated by analyzing some economic and financial phenomena, those econometric factor 
models aim to identify the common factors in the sense that each common factor affects 
the dynamics of most of the original p time series. Those common factors are separated 
from the so-called idiosyncratic 'noise' components; each idiosyncratic component may at 
most affect the dynamics of a few original time series. Note an idiosyncratic noise series is 
not necessarily white noise. The rigorous defini tion /identification of the common fa ctors 
and the idios y ncrat ic noise was established by IChamberlain and Rothschild! (119831 ) and 



Chamberlain! (119831 ) in an asymptotic manner when the number of time series goes to 



infinity, i.e. thos e econometr i c fact or models are only asymptotically identifiable when 

(120001 ). 



p — > oo. See also 



Forni et al. 



We adopt a different and more statistical approach in this paper from a purely 



2 



dimension-reduction poin t of v iew. Our mod e 



Pena and Poncelal ( 120061 ) and |Pan and Yaol (120081 ) . However we consider the inference 



is similar to those in 



Pefia and Boxl rtl987T ). 



when p is as large as, or even larger than, n. Furthermore, we allow the future fac- 
tors to depend on past (white) noise. This substantially enlarge the capacity of the 
model. Different from the aforementioned econometric factor models, we decompose the 
p-dimensional time series into two parts: the dynamic part driven by a low- dimensional 
factor and the static part which is a vector white noise. Such a conceptually simple de- 
composition brings in conveniences in both model identification and statistical inference. 
In fact the model is identifiable for any finite p. Furthermore the estimation for the 
factor loading matrix and the factor process itself is equivalent to an eigenanalysis for a 
p x p non-negative definite matrix, therefore is applicable when p is in the order of a few 
thousands. Our estimation procedur e is rooted at the s ame i dea as those on which the 
methods of IPena and Poncelal (120061 ) and |Pan and Yaol (120081) were based. H owever our 



method itself is substantially simpler. For example, IPena and Poncelal (120061 ) requires to 
compute the inverses of sample autocovariance matrices, which is computationally more 
costly when p is large, and is invalid when p > n. Furthermore in contrast to the eige- 
nanalysis for one matrix, it performs eigenanalys is for a matrix funct ion of the sample 



autocovariance for s everal different lags; see also 



Pefia and Box 



jl987h . 



The method of 



Pan and Yaol ( 120081 ) involves solving several nonlinear optimization problems, which is 



designed to handle non-stationary factors and is only feasible for moderately large p. 
Our approach identifies factors based on autocorr elation struc t ure, w hich is more r ele- 
vant than the least squares approach advocated by iBai and Ngl (120021 ) and iBail (120031 ) in 
the context of identifying time series factors. In fact our method outperforms the least 
squares method in a numerical experiment reported in section [6l 

The major theoretical contribution of this paper is to reveal an interesting and some- 
how intriguing feature in factor modelling: the estimator for the factor loading matrix 
and the resulting estimator for the precision matrix of the original p-dimensional time 
series converge to the true ones at the rates independent of p, provided that all the fac- 
tors are strong in the sense that the norm of each colunms in the factor loading matrix 
is of the order p 1 / 2 . Our simulation results indicate that indeed the estimation errors 
are indeed independent of p. This result exhibits clearly that the 'curse' is canceled out 
by the 'blessings' in dimensionality, as the high dimensionality is offset by combining 
together the information from high- dimensional data via common factors. However our 
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factor model cannot improve the estimation for the c ovariance matrix of the original time 

(120081 ) with independent 



Fan et al. 



series, which coincides with the result established by 
observations and known factors. 

Another interesting finding from our asymptotic theory is to use a two-step estimation 
procedure for a better performance when some factors are strong and some are not. To 
this end, we characterize the strength of factors explicitly by an index and show that the 
convergence rates of the estimat ors depend on those indices. The concept of weak and 



strong factors was introduced in 



Chudik et al. 



( 120091 ) in a different but related manner. 



Further development of our setting with n onstationary facto rs together with forecast 



Lam et al. 



(120101 ). 



ing issues are reported in a companion paper 

The rest of the paper is organized as follows. The model, its presentational issues and 
the estimation methods are presented in Section |2l Section [3] contains the asymptotic 
properties of the proposed methods when all the factors are of the same strength. The 
results for the cases when there exist factors of different levels of strength are given in 
section HJ Extensive simulation results are presented in section [51 with analysis of a set 
of implied volatility data in section |6j All technical proofs are relegated to section [JJ 



2 Models and estimation methodology 
2.1 Factor models 

Let yi, ■ • • , y n be n p x 1 successive observations from a vector time series process. The 
factor model assumes 

y ( = Ax i + e i , (2.1) 

where x f is a r x 1 unobserved factor time series which is assumed to be strictly stationary 
with finite first two moments, A is a p x r unknown constant factor loadings matrix, and 
r < p is the number of factors, and {e t } is a white noise with mean and covariance 
matrix S e . Furthermore, we assume that Cov(e t ,x s ) = for all s < t, and no linear 
combinations of the components of x 4 are white noise. (Otherwise such combinations 
should be absorbed in e t .) 



Model (12.11) has been studied by, for example, iPena and Box! (119871 ) and 



Pena and Poncela 



(120061 ) with a stronger condition that the factor process and the white noise are uncorre- 



cted across all the lags. We relax this condition to allow the future factor x.t+k correlated 
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with the past white noise e t (k > 1). This is an appealing feature in modelling some 
economic and finacial data. 

In this paper, we always assume that the number of factors r is know n and fixed- 
There is a la r ge body of literature on how to determ i ne r. See, for example. iBai and Ng 
( 20021 . 12OO7L Hallin and Liskal f 200^ 1 . Iran and Yaol (2008) and Bathia et al] ~f 2010 ). In 



section 0, we use an information criterion proposed by 
r in our simulation study. 



Bai and Ngi (120021 ) to determine 



2.2 Identifiability and factor strength 

Model (12. ip is unchanged if we replace the pair (A,x t ) on the RHS by (AH,H _1 xt) for 
any invertible H. However the linear space spanned by the colunms of A, denoted by 
Ai(A) and called the factor loading space, is uniquely defined by (12.11) . Note Ai(A) = 
.M(AH) for any invertible H. Once such an A is specified, the factor process x t is 
uniquely defined accordingly. We see the lack of uniqueness of A as an advantage, as we 
may choose a particular A which facilitates our estimation in a simple and convenient 
manner. Before we specify explicitly such an A in section 12.31 below, we introduce an 
index 5 for measuring the strength of factors, which is defined naturally in terms of A in 
the first instance. See conditions (A) and (B) below. 

Let £ x (fc) = Cov(x t+fc ,x t ), £ X)£ (fc) = Cov(x m , e f ), and 

n—k 

E x (fc) = (n- k)- 1 J2(*t+k ~ x)(x t - xf, 
t=i 

n—k 

Sx, 6 (A;) = in - ky l ^(xt+fc - x)(e t - ef, 

t=i 

with x = n~ l Y^t=i x *' ^ = n ~ l Ylt=i e *- ^«(^) an d E £)X (A;) are defined similarly. Denote 
by ||M|| the spectral norm of M, which is the positive square root of the maximum 
eigenvalue of MM T ; and by ||M|| min the positive square root of the minimum eigenvalue 
of MM T or M T M, whichever has a smaller matrix size. The notation a x b represents 
a = 0(b) and b = 0(a). Now we introduce the conditions on the strength of factors. 

(A) For k — 0, 1, • • ■ , ko, where ko > 1 is a small positive integer, S x (^) is full-ranked. 
The cross autocovariance matrix S Xi£ (/c) has elements of order 0(1). 

(B) A = (a x • • -a r ) such that ||ai|| 2 x p 1 " 5 , i — r, < S < 1. 
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(C) For each i = 1, • • • , r and 5 given in (B), min^. ^j || a £ — ^j a j\\ 2 x P 1 S - 

Note that model (12.11) is practically useful only if r << p. In our asymptotic theory 
we assume that r remains as a constant while both p and n go to infinity. Therefore 
S x (&) is an r x r fixed matrix of the full rank; see (A). When 5 = in assumption (B), 
the corresponding factors are called strong factors since it includes the case where each 
element of a, is 0(1), implying that the factors are shared (strongly) by the majority 
of the y cross-sectional var iables. On the other hand, they are called weak factors when 
5 > 0. 



Chudik et al. 



(120091 ) introduced a notion of strong and weak factors, determined 
by the finiteness of the mean absolute values of the component of a^. In this paper, we 
introduce index 5 which links explicitly the strength of factors x t and the convergence 
rates of our estimators. In fact the convergence is slower in the presence of weak factors. 
Assumptions (B) and (C) together ensure that all r factors in the model are of the equal 
strength. 

To facilitate our estimation, we normalize the factor loadings matrix suc h that all the 



columns of A are orthonormal, i.e. A T A = I r ; see, e.g. Pan and Yaol ( 2008 ). Then under 



assumptions (A) - (C), model (12.11) admits the follow representation. Its proof is given 
in the beginning of section [7| below. 

y t = Axj + e f , A T A = I r , with 
||S x (fc)|| x^x ||S x (fc)|U, ||E x , 6 (fc)|| =0(p 1 -'/ 2 ), 

(2.2) 

Cov(x<j, €f) = for all s < t, 
Cov(e s , e t ) = for all s ^ t. 

Unless specified otherwise, all yt, x 4 and e t in the rest of this section and also section [3] 
are defined in (12. 2p . 

2.3 Estimation 

For k > 1, model (I2.2p implies that 

E y (A;) = Cov(y m , y t ) = AS X (A;)A T + AS x , £ (fc). (2.3) 
For ko > 1 given in condition (A), define 

L = ^E y (fc)S y (fc) T = Af ^{S x (fc)A T +S x , e (A;)}{S x (A;)A r +S x , e (fc)} T )A T . (2.4) 

k=l ^ k=l ' 
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Obviously L is a p x p non-negative definite matrix. Now we are ready to specify the 
factor loading matrix A to be used in our estimation. First note that (I2.2p is unchanged if 
we replace (A, xt) by (AQ, Q T x t ) for any rxr orthogonal matrix Q. Apply the spectrum 
decomposition to the positive-definite matrix sandwiched by A and A T on the RHS of 
(E3D, i.e. 

f^{S x (fc)A r + S x , e (A:)}{S x (fc)A T + ^ e (k)} T = QDQ T , 

fc=i 

where Q is an r xr orthogonal matrix, and D is a diagonal matrix with the elements on the 
main diagonal in descending order. This leads to L = AQDQ T A T . As Q T A T AQ = I r , 
the columns of AQ are the eigenvectors of L corresponding to its r non-zero eigenvalues. 
We take AQ as the A to be used in our inference, i.e. 

the columns of the factor loading matrix A are the r orthonormal eigenvectors 
of the matrix L corresponding to its r non-zero eigenvalues. 

A natural estimator for the A specified above is defined as A = (a 1; • • • , a r ), where 
a~j are the eigenvector of L corresponding to its z'-th largest eigenvalues, £li , • • • , ciy. are 
orthonormal, and 

£,'0 -i n—k 

L = £E y (fc)E y (fc) r , S y (A;) = — 5>t +fe -y)(y t -yf, (2.5) 

k=l i=l 

with y = n" 1 YJl=iYt- 

Consequently, we estimate the factors and the residuals respectively by 

% = A T y t , e t = y t - A% = (I p - AA T )y, (2.6) 

3 Asymptotic theory 

In this section we present the rates of convergence for the estimator A for model (12 .2p . as 
well as the corresponding estimators for the covariance matrix and the precision matrix, 
derived from model (12. 2p . We need the following assumption for the original model (12.11) : 

(D) It holds for any < k < k that the elementwise rates of convergence for S x (fc) — 
E x (&), S Xi£ (/c) - S Xj£ (/c) and S e (fc) - S e (fc) are respectively P (n~ lx ), Op(n~ lxf ) 
and Op(n~ lt ), for some constants < l x ,l xe ,l e < 1/2. We also have, elementwise, 
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With the above assumption on the elementwise convergence for the sample cross- and 
auto-covariance matrices of x t and e t , we specify the convergence rate in the spectral 
norm for the estimated factor loading matrix A. It goes without saying explicitly that 
we may replace some a, by — a,- in order to match the direction of a.j. 

Theorem 1 Let assumption (D) hold, \\Yi^ e {k)\\ = o{p l ~ 6 ), and the r non-zero eigen- 
values of matrix L in \2.4\j are different. Then under model 112.2}) , it holds that 

|| A - A|| = P (h n ) = P {n- 1 * + p 5/2 rT 1 ^ + v s rT h ), 

provided h n = o(l). 

This theorem shows explicitly how the strength of the factors 6 affects the rate of con- 
vergence. The convergence is faster when the factors are stronger (i.e. § gets smaller). 
When 5 = 0, the rate is independent of p. This shows that the curse of dimensionality 
is offset by the information from the cross-sectional data when the factors are strong. 
Note that this result does not need explicit constraints on the structure of S £ other than 
implicit constraints from assumption (D). 

The assumption that all the non-zero eigenvalues of L are different is not essential, 
and is merely introduced to simplify the presentation in the sense that Theorem [T] now 
can deal with the convergence of the estimator for A directly. Otherwise a discrepancy 
measure for two linear spaces has to be introduced in order to make state ments on the 
conver gence rate of the estimator for the factor loading space Ai(A); see 



Pan and Yao 



fl2008h . 

To present the rates of convergence for the covariance matrix estimator S y of S y = 
Var(y 4 ) and its inverse, we introduce more conditions. 

(Ml) The error-variance matrix X! e is of the form 

S e = diag^l^, a\l T m2 , ■■■ ,a 2 k ll k ), 

where fc, mi, • ■ ■ , m^ > 1 are integers, l mj denotes the rrij x 1 vector of ones, and 
all the <t| are uniformly bounded away from and infinity as n — > oo. 

(M2) It holds that p 1 ~ 5 s~ 1 h 2 l — > and r 2 s~ 1 < p 1 ^ 6 ^, where s = minixj^mj and h n 
given in Theorem [TJ 
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Condition (Ml) assumes that the white noise components for p time series are uncorre- 
cted with each other at any fixed time. Furthermore, there are only maximum k different 
values among their variances. This facilitates a consistent pooled estimator for S e ; see 
also (M2). 

We estimate S y by 



S y = AS X A T + S e , with S x = A T (S y - S e )A and 
E^diag^l^ 2 !^,-- - ,d 2 k ll k ), o* = n- 1 8j 1 \\A j S\\ 



(3.7) 



where Aj = diag(0^, • • • , 0^, 1^., 0£. +1 , • • • , 0£J, E = (I p - AA T )(y x • • • y n ), and 
the norm ||M||^ denotes the Frobenius norm, defined by \\M\\ F = tr(M T M) 1 / 2 . 

In practice we do not know the value of k and the grouping. We may start with one 
single group, and estimate S e = a 2 I p . By looking at the sample covariance matrix of 
the resulting residuals, we may group together the variables with similar magnitude of 
variances. We then fit the model again with the constrained covariance structure specified 
in (Ml). 

The theorem below presents the convergence rates for the sample covariance estimator 
S y = S y (0) defined in (12. 5p and the factor model based estimator S y defined in (13.71) . 

Theorem 2 Under assumption (D), it holds that 

||S y - Sy|| = Opip^K) = P (p l - s n- 1 * +p l - s > 2 n- 1 ^ +pn~ h ). 

Furthermore, 

||Sy-Sy|| =0 P (P 1 - S h n ), 

provided that the condition of Theorem^ and (Ml) and (M2) also hold. 

Theorem [2] indicates that asymptotically there is little difference in using the sample 
covariance matrix or the factor model-based covariance matrix estimator even when all 
the factors are stron g, as both t he es timators have rates of convergence linear in p. This 
result is in line with 



Fan et al 



(120081 ) which shows that the sample covariance matrix as 
well as the factor model-based covariance matrix estimator are consistent in Frobenius 
norm at a rate linear in p, with the factors known in advance. We further illustrate this 
phenomenon numerically in section [5j 

However as for the estimation for the precision matrix S y , the estimator S y per- 
forms significantly better than the sample counterpart. 
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Theorem 3 Under the condition of Theorem^ (Ml) and (M2), it holds that 

HSy 1 - = P ((l + (P 1 -**- 1 ) 1 / 2 )^) 

= P (h n + (ps-^ip-^n- 1 * + rT l *< + p s/2 n - k )). 



Furthermore, 



|| - Sy 1 !! = P {p 1 - 5 h n ) = || Sy - Sy|| 

provided p x ~^h n = o(l) and p < n. 

Note that if p > n, S y is singular and the rate for the inverse sample covariance 
matrix becomes unbounded. If the factors are weak (i.e. 5 > 0), p will still be in the 
above rate for S y . On the other hand if the factors are strong (i.e. 5 = 0) and sxp, 
the rate is independent of p. Hence the factor model-based estimator for the precision 
matrix S" 1 is consistent in spectral norm irrespective of the dimension of the problem. 
Note that the condition s x p is fulfilled when the number of groups with different 
error variances is small. In this case, the above rate is better than the Frobenius norm 



convergence rate obtained in Theorem 3 of iFan et al.l (120081 ). This is not surprising since 



the spectral norm is always smaller than the Frobenius norm. On the other hand, the 
sample precision matrix has the convergence rate linear in p, which is the same as for the 
sa mple covariance m atrix. 



Fan et al 



(2008 



1) studied the rate of convergence for a factor model-based precision 
matrix estimator under the Frobenius norm and the transformed Frobenius norm || ■ ||s 
defined as 

||B|| s =p- 1 / 2 ||S- 1 / 2 BS- 1 / 2 || F , 

where || • ||f denotes the Frobenius norm. They show that the convergence rate under 
the Frobenius norm still depends on p, while the rate under the transformed Frobenius 
norm S = I] y is independent of p. Since S y is unknown in practice, the latter result has 
little practical impact. 



4 Factors with different levels of strength 
4.1 Models and two estimation procedures 

Theorems Q] and [3] show that the strength of factors plays an important role in the 
convergence rates of our estimators. To investigate the impact from the presence of the 
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different levels of factor strength, we consider the case that the factors are of two levels 
of strength. The cases with more than two levels may be treated with more complex 
technical details. 

In view of model (I2.2p . we assume that we have two group of factors, x 4 = (x^ x^) T 
and A = (Ai A 2 ), where x Jt is a r j x 1 vector and Aj is a p x rj constant matrix for 
j — 1,2. The model we consider is then 

y t = Aix u + A 2 x 2 t + e t , AjAj = I rj and Af A 2 = 0, with 
llEtfMU xp 1 ^ x ||S,,(A;)|| mm , ||S 12 (fc)|| = O^/ 2 ^/ 2 ) = ||E 21 (fc)||, 
\\T l3e {k)\\=0{p 1 -^ 12 ), (4.8) 
Cov(x :)S , e t ) = for all s <t, 
Cov(e s , e t ) = for all s ^ t. 

Unless specified otherwise, all y t , x^ and e t in the sequel of this section are defined 

in 



We may continue apply the estimation method outlined in section 12.31 to obtain the 
estimator A = (A!,A 2 ). However such a simple procedure may encounter problems 
when some factors are weak, or are much weaker than the others. Since the eigenvalues 
corresponding to those weak factors are typically small, it may be difficult in practice 
to distinguish them from in the presence of some large eigenvalues. Under those cir- 
cumstances, we should remove the strong (or stronger) factors first, and then repeat the 
estimation procedure again in order to identify the weak (or weaker) factors . This is the 



essential idea behind the two-step procedure proposed by iPena and Poncelal (120061 ) with 
some illustrative numerical examples. We provide below a theoretical justification for 
using the two-step estimation method for model (14. 8p in which the factors are of different 
levels of strength. 

We assume that r± and r 2 are known. Our two-step procedure is defined as follows: 
(i) By ignoring the term A 2 x 2t in model (14.81) . apply the estimation method in section |2~31 
to obtain A x . (ii) By removing factor jc u , 

y* = y t - A t Afy t , (4.9) 



estimate A 2 from model y^ = A 2 x 2t + e* t using the method of section 12.31 The estimator 
obtained is denoted as A 2 , and we denote A = (Ai, A 2 ). 
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4.2 Asymptotic theory 

In view of model (12. 2p and the results from Lemma [1] derived from assumption (D), we 
directly assume the following for model (14. 8ft : 

(D)' For < k < ko and i = 1,2, the elementwise convergence rates for Xjj(fe) — 
Sjj(fc), Si 2 (fc) - Si 2 (fc), E ie (A;) - Sie(fc) and E e (fc) - S e (/c) are, respectively, 
Op{jp l ~ Si n- li ), Opip 1 - 51 / 2 ' 6 ^ 2 ^ 112 ), P {p {1 - Si)/2 n- 1 ^) and P (n- h ), where < 
li, 1x2, he, h < 1/2. Furthermore, S ei (fc) = OpftjC 1 - 4 *^^), S 21 (A;) - E 2 i(A;) = 
Op(p 1_{l/2 " j2/2 n" il2 ) elementwisely. 

Theorem 4 Lei ||Ej e (A;)|| = ofjo 1 " 51 ) fori = 1,2, and condition (D)' hold. Under model 

||Ai - Ai|| = Op(cji), ||A 2 - A 2 || = P (w 2 ) = ||A - A|| 
provided u>i = o(l) and w 2 = o(l), and 

||A 2 - A 2 || = Opip^cj,) = ||A - A||, 

where 

u)! = n~ h +p 5l - S2 n~ h +p h ^n~ h2 +/ l/2 n" /l£ + p 5l - 52/2 n _/2£ +p &l n~ l % 

( p 2 * 2 " 25 ^, i/||£ 21 (fc)||^ = 0(p 1 -*'); 
w 2 = \ pF-^Ux, if ||S 21 (fc)|| min x p 1 -^ 2 , with 5, + 5 2 < c < 25 2 ; 
( p 5 *-^, i/||E 21 (A;)|| x^V^fc/^ x ||E 21 (fc)|| min . 

Theorem H] indicates that while the estimators for the loading Ai on the stronger 
factor Xit using the two methods are exactly the same, the estimation for the loading 
A 2 on the weaker factor may benefit from the two-step procedure, as the convergence 
rate for A 2 is faster than that for A 2 when ||S 2 i(A;)|| min = 0{p 1 ~ &2 ) for example. The 
practical implication of this result is that we should search in the residuals, after an 
initial fitting, for possible weak factors, especially if the number of non-zero eigenvalues 
is determined by some 'eyeball' test which remains as one of the most frequently used 
methods in practice. 

Theorem 5 Under the condition of Theorem^ ||x 2f || = op(||xit||) provided p 52 ~ 5l uj\ = 
o(l), where % = (xf 4 , xJ t ) T is defined in ( Iff. 6]) . 
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Theorem indicates that under the normalization condition AA T = I n the different 
levels of factor strength will also be reflected on the magnitude of the norms of the 
estimated factors. In the case that the norms of the estimated factors, derived from 



the method in section 12. 3[ differ substantially, the two-step procedure may be applied to 
improve the estimation. 

The next two theorems are on the convergence rates for the estimation for the covari- 
ance and the precision matrices of y<. To this end, we recast condition (M2) for model 
(BSD first. 



(M2)' For s = mini<j<fcmj, where rrii are given in condition (Ml), it holds that 
s -i < p 1 -5i||A*- A|| 2 and (1 + (p 1_5l s~ 1 ) 1/2 ) || A* - A|| = o P (l), 

where A* is either A or A. 

Theorem 6 Under model lji4-8\ ) and condition (D)', 

||S y - S y || = Opip^n" 11 +p 1 ~ S2 n- h +p 1 - 5l/2 n- h * + p l ~ S2/2 n - he + pn~ h ) 
= O p (p 1 - 5 'lo 1 ), 

\\% - S y || = Opip 1 ^ ||A - A||), ||S y - Sy|| = P {p l ^\\A - A||), 

where S y is the sample covariance matrix ofy t , S y is the factor-model based estimator 
defined by Jff. 7| ), and S y is defined in the same manner as S y with A replaced by A. 



Similar to Theorem [2j the above theorem indicates that the factor model-based ap- 
proach cannot improve the estimation for the covariance matrix of yt over the simple 
sample covariance matrix. In fact, it may do worse when the levels of factor strength 
differ substantially, rendering a worse convergence rate for ||A — A||. 



Theorem 7 Let conditions (D)', (Ml) and (M2)' holds. Under model U^W, 



HS^-E^IHOpdlSy-Syll) 

provided ||S y — S y || = op(l), and 

WK 1 - ^11 = Op((1 + (P^s-^WA - A||), 
HSy 1 - S y l = P ((1 + (p^s-^WA - A||), 
where S y , S y and S y are the same as in Theorem® 
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Similar to Theorem [31 Theorem [7] shows that the factor model-based methods may 
improve the estimation for the inverse covariance matrix of y t , especially with the two- 
step estimation method as the factors in model (14. 8p are of different levels. 

5 Simulations 

In this section, we illustrate our estimation methods and their properties via two simu- 
lated examples. 

Example 1. We start by a simple example to illustrate the properties exhibited in 
Theorem [TJ [2] and [3j Assume a one factor model 

y t = Ax t + 6*, e tj ~ i.i.d. N(0, 2 2 ), 

where the factor loading A is a p x 1 vector with 2cos(27ri/p) as its i-th element, and 
the factor time series is defined as x t = 0.9x t -i + r) t , where rj t are independent N(0, 2 2 ) 
random variables. Hence we have a strong factor for this model with 5 = 0. We set 
n = 200,500 and p = 20, 180,400, 1000. For each (n,p) combination, we generate from 
the model 50 samples and calculate the estimation errors as in Theorems [H [2] and [31 The 
results with n = 200 are listed in Table [T] below. The results with n = 500 is similar and 
thus not displayed. 



71 = 


200 


l|A-A|| 




l|s v -s- 1 !! 


l|S y -S y || 


l|S y -S y || 


P = 


= 20 


■022(.oo5) 


■24(.o3) 


■009 ( .oo2) 


218(165) 


218(165) 


P = 


180 


■023(.oo4) 


79.8( 2 g.s) 


•007 ( .ooi) 


1962( 1500 ) 


1963(1500) 


P = 


400 


■022(. 0O 4) 




■007 ( .ooi) 


4102 (3472) 


4103(3471) 


P = 


1000 


■023(.oo4) 




■007 ( .ooi) 


107 97( 68 20) 


108 00(6818) 



Table 1: Mean of estimation errors for the one factor example. The numbers in brackets 
are the corresponding standard deviations. 

It is clear from Table [TJ that the estimation errors in L 2 norm for A and S y are 

independent of p, as indicated by Theorem [T] and Theorem [3] with 5 = 0, as we have a 

— i 

strong factor in this example. The inverse of the sample covariance matrix S y is not 
defined for p > n, and is a bad estimator even when p < n as seen in above table. The 
last two columns show that the errors of estimators S y and S y increase as p increases. 
This is in agreement with Theorem [2J 



14 



Example 2. Now we consider a model with factors of different levels of strength. We 
generate data from model (14. 8 j) with r = 3 factors: 

xi tt = -0.8x M _i + 0.9e 1)t _i + e 1>t , 
x 2 ,t = -0.7x 2 , t -i + 0.85e 2 ,t_i + e 2>t , 
x 3tt = 0.8x2,* - 0.5x 3it _i + e 3)t , 

where e^t are independent iV(0, 1) random variables. For each column of A, we generate 
the first p/2 elements randomly from the U(—2,2) distribution; the rest are set to zero. 
We then adjust the strength of the factors by normalizing the columns, setting a^/p 1 ^/ 2 
as the z-th column of A (we set 5 2 = 5s). We let e t be p x 1 independent random 
vectors with mean and variance diag(0.5, 0.8, 0.5, 0.8, • • •), and the distributions of all 
the components of e t are either normal or t 5 (properly normalized such that the variance 
is either 0.5 or 0.8). 

We set n = 100, 200, 500, 1000 and p = 100, 200, 500. The first factor has strength 
index 5\ and the last two factors have strength index 5 2 . Both 5\ and 5 2 take values 0, 0.5 
or 1. For each combination of (n,p, 5\, 5 2 ), we replicate the simulation 100 times, and 
calculate the mean and the standard deviations of the error measures. 

Figured] displays the mean of ||S — S~ 1 \\ in the 100 replications. For precision matrix 
estimation, figure [1] shows clearly that when the number of factors are not underestimated, 
the two step procedure outperforms the simple one when strength of factors are different, 
and performs at least as good when the factors are of the same strength. The performance 
is better when the number of factors is in fact more than the optimal because we have 
used ko = 3 instead of just 1 or 2 when the serial correlations for the factors are in 
fact quite weak. Hence we accumulate pure noises, which sometimes introduces non- 
genuine factors that are stronger than the genuine ones, and requires the inclusion of 
more than necessary factors to reduce the errors. Not shown here, we have repeated 
the simulations with ko = 1, and the performance is much better and is optimal when 
the number of factors used is 3. The two-step procedure still outperforms the simple 
one. The simulations with normal errors are not shown here since the results are similar. 

The mean of || A — A|| and || A — A|| exhibit similar patterns as shown in figure Q] for 
- _i * 

|| X y — S y || and ||S — S y || respectively, and the results are not shown. 

For the covariance matrix estimation, our results (not shown) show that, as in Theo- 
rem both the sample covariance matrix and factor model based one are poor estimators 
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100 200 500 

Sample size n 



1000 



2.?' 



0.5 



— i — simple, p=100 
— i — two-step, p=1 00 

simple, p=200 
o ■ two-step, p=200 
-v- simple, p=500 
■V- two-step, p=500 




100 



200 

Sample size n 



500 



1000 




Figure 1: Mean of ||S — S" 1 ! and ||S — S~ 1 1| , noises distributed as t 5 . Top row: 
5\ = 62 = S3 = 0. Bottom row: 5± = 0, 62 = S3 = 0.5. Left column: Three factors used in 
estimation. Right column: Four factors used in estimation. The legends for the bottom 
row are the same as the top row. 

when p is large. In fact both the simple and two-step procedures yield worse estimation 
errors than the sample covariance matrix, although performance gap closes down as n 
gets larger. 
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6 Data Analysis : Implied Volatility Surfaces 



We illustrate the methodology developed through modeling the dynamic behavior of IBM, 
Microsoft and Dell implied volatility surfaces. The data was obtained from OptionMetrics 
via the WRDS database. The dates in question are 03/01/2006 - 29/12/2006 (250 days 
in total). For each day t we observe the implied volatility W t (ui,Vj) computed from call 
options as a function of time to maturity of 30, 60, 91, 122, 152, 182, 273, 365, 547 and 730 
calender days which we denote by Ui, i — 1, . . . ,p u (p u = 10) and deltas of 0.2, 0.25, 0.3, 
0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, and 0.8 which we denote by v j} j = 1, . . . ,p v 
(p v = 13). We collect these implied volatilities in the matrix W t = (W t (ui, u 4 )) £ W uXPv . 
Figure [2] displays the mean volatility surface of IBM, Microsoft and Dell over the period 
in question. It is clear from this graphic that the implied volatilities surfaces are not 
flat. Indeed any cross-section in the maturity or delta axis display the well documented 
volatility smile. 

dell ibm microsoft 




Figure 2: Mean implied volatility surfaces. 



It is a well docum e nted st ylized fact that imp lied v olatilities ar e non- stationary (see 



Cont and da Fonseca 


(1988 


), 


Fender et al. 


(2007 


) and 


Park et al. 


(2009) 



ers). Indeed, when applying the Dickey- Fuller test to each of the univariate time series 
Wt(ui,Vi), none of the p u x p v = 130 nulls of unit roots could be rejected at the 10% 
level. Of course we should treat the results of these tests with some caution since we 
are performing a large number of hypothesis tests, but even still the evidence in favor of 
unit roots is overwhelming. Therefore, instead of working with W t directly, we choose 
to work with AW t = W t — Wt_i. Our observations are then y t = vec{AW t }, where 



for any matrix M = (mi, 



l i»«xp 0j vec {M} 
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that Yt is now defined over 04/01/2006 — 29/12/2006 since we lose an observation due 
to differencing. Hence altogether there are 249 time points, and the dimension of y t is 
p = p v x p u = 130. 

We perform the factor model estimation on a rolling window of length 100 days. A 
window is defined from the i-th day to the (i + 99)-th day for % — 1, • • ■ , 150. The length 
of the window is chosen so that the stationary assumption of the data is approximately 
satisfied. For ea ch window, we com pare our methodology with the least squares based 
methodology by iBai and Ngl (120021 ) by estimating the factor loadings matrix and the 
factors series for the two methods. For the i-th window, we use an AR model to forecast 
the (i + 100)-th value of the estimated factor series x-^ 100 , so as to obtain a one-step ahead 
forecast y^+ioo = Ax||' 1O0 for yj+ioo- We then calculate the RMSE for the (i + 100)-th 
day defined by 

RMSE = p" 1/2 ||yff 100 - y i+ ioo||- 



More in depth theoretical as well as data analysis for forecasting is given in 
(120 lok 



Lam et al. 



6.1 Estimation results 



In forming the matrix L for each window, we take ko = 5 in (12.51) , taking advantage 
that the autocorrelations are not weak even at higher lags, though similar results (not 
reported here) are obtained for smaller fc - 

Figure [3] displays the average of each ordered eigenvalue over the 150 windows. The 
left hand side shows the average of the largest to the average of the tenth largest eigenvalue 
of L for Dell, IBM and Microsoft for our method, whereas the right hand side shows the 
second to eleventh largest. We obtain similar results for the lBai and Ngi (120021 ) procedure 
and thus the corresponding graph is not shown. 

From this graphic it is apparent that there is one eigenvalue that is much larger than 
the others for all three companies for each window. We have done automatic se l ection 
for the number of factors for each window using the IC P \ criterion in IBai and Ngi ( 120021 ) 
and a one factor model is consistently obtained for each window and for each company. 
Hence both methods chose a one factor model over the 150 windows. 

Figure S] displays the cumulative RMSE over the 150 windows for each method. We 



choose a benchmark procedure (green line in each plot), where we 



value as the one-step ahead forecast. Except for Dell where IBai and Ngi ( 120021 ) procedure 



ust tr eat today's 
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Figure 3: Averages of ordered eigenvalues of L over the 150 windows. Left: Ten largest. 
Right: Second to eleventh largest. 
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Figure 4: The cumulative RMSE over the 150 windows. Red dotted: \Bai and Ngi ({2002) 
procedure. Green: Taking forecast y£_\ to be y t . Black: Our method. 



is doing marginally better, our methodo l ogy c onsistently outperforms the benchmark 
procedure and is better than iBai and Ngl (120021 ) for IBM and Microsoft. 
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7 Proofs 

First of all, we show how model ( 12. 2 p can be derived from ( 12. ip . 

Applying the standard QR decomposition, we may write A = QR, where Q is a p x r 
matrix such that Q T Q = I r , R is an r x r upper triangular matrix. Therefore model 
( 12. ip can be expressed as 

Yt = Q*t + e t , 

where x.' t = Rx t . With assumptions (A) to (C), the diagonal entries of R are all asymp- 

1-6 „. 

totic to p 2 . Since r is a constant, using 

IIRII = max llRull, ||R|| m m = min llRull, 

||u||=l l|u||=l 

and the fact that R is an r x r upper triangular matrix with all diagonal elements having 
the largest order p 1 ^ , we have 



||R|| xpV x ||R|| mm . 

Thus, for k = 1, • • • , k , S x /(fc) = Cov(x' t _ k , xj) = RS X (A;)R T , with 

p 1 ' 5 x ||R||^ in - ||S x (A;)|| min < ||S x ,(A;)|| min < ||£ x >(fc)|| < ||R|| 2 • ||S x (fc) || x p 1 " 5 , 

so that ||S X (A;)|| xp w x ||S x (A;)|| m i n . We used ||AB|| min > ||A|| min ■ ||B|| min , which can 

be proved by noting 

„ u r B T A r ABu (Bu) T A T A(Bu) llBull 2 

AB min = mm — — > mm — — - 

u^o ||u|| 2 u^o ||Bu|| 2 ||u|| 2 

w T A T Aw llBull 2 
> mm — n — ttt mm = A min • B min . (7.1) 

w^O || W || ^ u^O ||ll|| 2 

Finally, using assumption (A) that H, xe (k) = 0(1) elementwise, and that it has rp x p 
elements, we have 

||£ X ',e(A;)|| = ||RS x , e (fc)|| < ||R|| • ||Sx, 6 (A;)||jr = 0{p^) ■ 0(p^ 2 ) = 0{p l ~ 5 ' 2 ). 

Before proving the theorems in section [31 we need to have three lemmas. 

Lemma 1 Under the factor model 112. fy) which is a reformulation of \2. and under 
condition (D) in section \2~2\ we have for < k < fco, 

||S X (A;) - E x (fc)|| = Op^'V^), ||E e (fc) - S 6 (A;)|| = P (pn' h ), 
||E x , e (A;) - E x , e (A;)|| = P {p l ~ s ' 2 n- 1 -) = ||S e , x (A;) - E e , x (A;)||, 

for some constants < l x AxeAe — 1/2- Moreover, ||x f || 2 = Op(p 1 ^ 5 ) for all real t. 
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Proof. Using the notations in section I2T2"} let x t be the factors in model ( 12. ip . and 
x.' t be the factors in model ( 12. 2p . with the relation that x£ = Rx t , where R is an upper 
triangular matrix with ||R|| x p~^~ x ||R|| m in (see the start of this section for more details 
on R). Then we immediately have ||xj|| 2 < ||R|| 2 • ||x t || 2 = Op(p 1 ' s r) = Op(p 1 ^ s ). 

Also, the covariance matrix and the sample covariance matrix for {xj} are respectively 

E^fc) = R£ X (A;)R T , S X /(A;) = R£ X (»R T , 

where E x (fc) and S x (/c) are respectively the covariance matrix and the sample covariance 
matrix for the factors {x t }. Hence 

HErfO) - E*(k)\\ < ||R|| 2 ■ ||E x (fc) - E x (fc)|| 
= 0{p 1 ~ 5 ) -OpirT 1 ' t) 

which is the rate specified in the lemma. We used the fact that the matrix S X (A;) — S x (/c) 
has r 2 elements, with elementwise rate of convergence being 0(n~ lx ) as in assumption 



Golub and Van Loan! (119961). which is stated ex- 



Johnstone and Arthur 



(D). Other rates can be derived similar 
The following is Theorem 8.1.10 in 
plicitly since most of our main theorems are based on this. See 
J2OO9]) also. 

Lemma 2 Suppose A and A + E are n x n symmetric matrices and that 

Q = [Qi Q2] (Qi m n x r, Q 2 is n x (n — r)) 

is an orthogonal matrix such that spanf^Qx) is an invariant subspace for A (i.e., spanf^Qx) C 
span(A) ). Partition the matrices Q T AQ and Q r EQ as follows: 

7/'sep(Dx, D2) := minAeA(Di), ^ga(d 2 ) l-^ - A*| > 0, where X(M) denotes the set of eigenval- 
ues of the matrix M , and 

sep(D 1 ,D 2 ) 
l|E||< , 

then there exists a matrix P G R("~ r ) xr with 

l|P|1 ^ sep(D 1 ,D 2 ) l|E2111 

such that the columns of Qi = (Qi + Q2PXI + P T P^j~ 1 / 2 define an orthonormal basis for 
a subspace that is invariant for A + E. 
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Proof of Theorem [JJ. Under model (12.21) . the assumption that ||S x e (/c)|| = o(p 1 ^ 5 ), 
and the definition of L and D^, in section 12.31 such that LA = AD, D has non-zero 
eigenvalues of order p 2 ~ 2S , contributed by the term S x (/c)S x (/c) T . If B is an orthogonal 
complement of A, then LB = 0, and 

A T \ T , . _ /DO 



B T )^ AB ^={0 oj' ^ 

with sep(D, 0) = A min (D) x p 2 ~ 2S (see Lemma [2] for the definition of the function sep). 
Define El = L — L, where L is defined in (12. 5p . Then it is easy to see that 

||EJl|| < £{\\%r(k) - S y (A;)|| 2 + 2||S sr (Ar)|| • \\fty{k) - S y (A;)||}. (7.3) 

fe=i 

Suppose we can show further that 

||E L || = P (p 2 - 25 n- 1 * + p 2 ^ 2 n- 1 ^ + p 2 ~ 5 n- h ) = Op{p 2 ~ 25 h n ), (7.4) 

then since h n = o(l), we have from (17.21) that 

||E L || = P (p 2 - 25 h n ) < sep(D,0)/5 

for sufficiently large n. Hence we can apply Lemma [2] to conclude that there exists a 
matrix P G R(f- r ) xr such that 

||P|| < t n J |(E L ) 21 |l < 4 ||E L || = P (h n ), 
sep(D,0) sep(D,0) 

and A = (A + BP) (I + p T p) -1 / 2 is an estimator for A. Then we have 

||A- A|| = ||(A(I- (I + P T P) 1/2 ) + BP)(I + P r P)- 1/2 || 

< ||I- (I + P T P) 1/2 || + ||P|| 

< 2||P|| = P (h n ). 

Hence it remains to show (I7.4p . To this end, consider for k > 1, 

||S y (A;)|| = ||AS x (A;)A T + AS Xie (A;)|| < ||E x (fc)|| + ||E x , e (fc)|| = Ofr 1 " 5 ) (7.5) 

by assumptions in model (12.21) and ||S Xj£ (A;)|| = o(||S x (A;)||). Finally, noting ||A|| = 1, 

||S y (A;) - £ y (fc)|| <||S X (A;) - E x (fc)|| + 2||S X , 6 (A;) - £*, 6 (fc)|| 

+ ||£ e (fc)-£ e (A;)|| (7.6) 
=0 P {p l ~ s n- lx +p 1 - 5/2 n- 1 ^ +pn~ u ) 
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by Lemma [TJ With (I7.5p and ( 17. 6p . we can conclude from ( 17. 3p that 

||E L || = P (||S y (A;)||-||S y (A;)-E y (A;)||), 
which is exactly the order specified in (17. 4p . □ 

Proof of Theorem [2J For the sample covariance matrix S y , note that (17.61) is 
applicable to the case when k = 0, so that 

||S y - S y || = OpG^-V* + p^ s/2 n- lxf - +pn~ h ) = P (p 1 " <5 /i„). 



(7.7) 



For S y , we have 

|| S y - Sy|| < ||A£ X A T - A£ X A T || + ||S e - S 6 || 

= P (||A - A|| ■ ||£ x || + ||S X - S x || + ||S e - S e ||). 

We first consider ||S e — S e || = max^xp |<t 2 — cr ? | < h + h + h, where 

h = n- 1 s- 1 \\A j (I p -AA T )AX\\ 2 F , 

I 2 = In-^-lA,-^ - AA r )E||^ - a% 

h = 2™~\<r 1 ||A j (I p - AA T )AX|| F ■ HA^Ip - AA T )E|| F , 

with X = (xi ■ • • x n ), E = (ei • • • e n ). We have 

h < n-^-iA^Ip - AA T )(A - A)X||| 

< rrt^HAJ 2 • ||I P - AA r f • ||A - A|| 2 ■ ||X|| 2 

= Opip^s-'hl), (7.8) 

where we used Theorem [T] for the rate ||A — A|| 2 = Op(/i 2 ), and Lemma[T]to get ||X|| F = 
Op(p 1 ~ 5 n). Also we used ||Aj|| = 1 and ||I P — AA T || < 2. For I 2 , consider 

h < In-^^IIA.-EHl - a)\ + n-^-lA.AA^Hl 

= P (n^) + Op^-^-^IIA.AA^H 2 , + \\AjA(A - A) T E|||,)) 

= P (n^) + P (n- 1 S - 1 (\\A T B\\ 2 F + || (A - A) r E|||)) 

= Op{rT 1 ') + Op(n~ 1 s _1 (||A r E|| F )), (7.9) 

where we used assumption (D) in arriving at Irz" 1 ^" 1 1| AyE||p — <x 2 | = op(n~' e ), and that 
|| (A- A) T E||p < ||A T E||p for sufficiently large n since ||A — A|| = op(l). Consider afe^ 
which is the (i, j)-th element of A T E. We have 

E(ajej) = 0, Var(af e,-) = af E £ a; < max a] = 0(1) 

i<i<p 
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since the aj's are uniformly bounded away from infinity by assumption (Ml). Hence each 
element in A T E is Op(l), which implies that 



n s 



U-1|I A T E ||2 = q ( n -l s -l . rn) = Qpis- 1 ] 



Hence from (17. 9p we have 

I 2 = P {n^ + S' 1 ). (7.10) 

Assumption (M2) ensures that both I\ and I 2 are op(l) from (17.81) and (17.101) respectively. 
From these we can see that h = P (ll /2 ) = P ({j) l - 5 s- l ) l l 2 h n ), which shows that 

||E e -E e || =0 P ((p 1 ~ s s~ 1 f 2 h n ). (7.11) 

Next we consider ||E X — E x || < K\ + K 2 + K3 + K4, where 

K x = ||A r AS x A T A - E x ||, K 2 = || A r A£ x>e A||, 

K 3 = ||A T E e , x A T A||, K 4 = ||A T (E e - £ e )A||, 

where E x = rC x 2~^"=i( x < ~~ x )( x < — X ) T - Now 

K x < ||A T A-L r ||-||S x || + ||S x -S x || 

= P (\\A T A - I r \\ ■ (||£ x - S x || + ||S X ||) + ||S X - S x ||) 
= Op^K), 

where we used ||A T A - I r || = ||(A - A) r A|| = P (h n ) by Theorem HJ ||S X || = 0{p l ~ s ) 
from assumption in model (12. 2p . and ||S X — S x || = Op{p l ~ 5 n~ lx ) from Lemma [TJ Next, 
using Lemma [T]and the fact that £ xe = 0, we have 

K 2 = O p (K 3 ) = P (||£ x , e - E x>6 ||) = Opip^n- 1 -). 

Finally, using Lemma [T] again and (17. lip , 

K, = O p {\\% - E 6 || + ||E e - E e ||) = Op^s- 1 ) 1 ^ +pn- 1 *). 

Looking at the rates for K\ to K4, and noting assumption (M2) and the definition of h n , 
we can easily see that 

||E X -E X || =0 P (p 1 ~ s K). (7.12) 

From (17. 7p . combining (17.111) and (17.121) and noting assumption (M2), the rate for E y in 
the spectral norm is established, and the proof of the theorem completes. □ 
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Proof of Theorem [31 We first show the rate for S y . We use the standard inequality 



!|M--M 2 -1< II^IMIM.-M, 



1- \\(M 1 -M 2 )-M 2 



in ' 



(7.13) 



with Mi = S y and M 2 
that 



1=,-! 



S y . Under assumption (Ml) we have ||S e 



US; 1 - S^A^ 1 + A T S e - 1 A)- 1 A T S e - 1 || 

odls^n + ns^iMKs^ + A^A)- 1 ! 

0(1), 



IE: 1 ! 



so 



where we also used 



A^An < IKA^s^A)- 1 ) 



= A max {(A r S e - 1 A)- 1 } = X~L(A T ^A) = 0(1), (7.14) 

since the eigenvalues of Sj 1 are of constant order by assumption (Ml), and ||A|| m i n = 1 
with Ax 7^ for any x since A is of full rank with p > r, so that 

x T A T S7 1 Ax - T ^-^ 



A min (A T S^ 1 A) = min - "„ , e > min J 



y r sr x y • l|Ax|| 

— - — - — ■ mm 1 - — — 

y^o ||y|| x^o ||x|| 



= A^S: 1 ) ■ ||A|| min x 1. (7.15) 
Then by (17.131) together with Theorem [2] that ||S y — S y || = Op(p 1 ~ s h n ) = op(l), we have 

which is what we need to show. 



l-op(l) 



Now we show the rate for S y . Using the Sherman-Morrison- Woodbury formula, we 
< Y?j=i K ji where 



have || S y - S" 1 



K 5 



(S; 1 - S: 1 )A(S X 1 + A T S J 1 A) -1 A T S J 1 1| , 

s: 1 A(s x 1 + \ T fCZ)-iZ T {fC - s; 1 )!!, 
S7 X (A - A)(S X X + X^Xy^Y^i 
S7 1 A(S; 1 + A T YT e l A)-\A - AfV-% 



(7.16) 



S7 X A{(S X + A T S; A)- 1 - (S x x + A r S7 1 A)- 1 }A T S" ] 
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First, we have HS^H = 0(1) as before by assumption (Ml). Next, 

K, = max \af - af\ = "Tf^fl'f lu/2 , „ = P ((p^ s s" 1 )^), 

i<j<fc' 3 3 mmi< J < fe (7|((7? + Op((p 1 - 5 s- 1 ) 1 /2/ in )) 

where we used (17.111) and assumptions (Ml) and (M2). From these, we have 

||S; 1 ||=0 P (1). (7.17) 

Also, like (EH, 

IKS. + A^'a)- 1 !! =O p (1), (7.18) 

noting (17.111) and assumption (Ml). With these rates and noting that ||A|| = ||A|| = 1 
and || A — A|| = Op{h n ) from Theorem [TJ we have from (I7.16P that 



|s y x - s^H = Opdp^s-^K + h n ) 

+ OpUKK 1 + A^A)- 1 - (S; 1 + A^S^A)- 1 



where the last term is contributed from Kq. Using (I7.14p and (17.181) . and the inequality 
||Mf 1 - M 2 _1 || = OpiWM^W ■ \\Mj - M 2 \\ ■ ||M 2 -1 ||), the rate for this term can be shown 
to be Op(Li + L 2 ), where 

L y = US, 1 - S^H, L 2 = \\A T ?CA - A^AI). 

Consider ||S X 1 || < US, 1 )! = A m [ n (S x ) = 0(p-^) by assumption in model (Q. With 
this and (I7.12p . substituting M x = S x and M 2 = S x into (17.131) . we have 

_ 0(p-^).0 P (p^h n ) _ P lp-MK) _ Q ( - (1 - S)h , (7 „ m 

Ll " l-0 P (p^h n .p-(^)) ~ l-op(l) " ° P{P K) - (7 ' 20) 

For L 2l using ||A|| = ||A|| = 1, ||A — A|| = Op(h n ) from Theorem [TJ the rate for K\ 
shown before and (I7.17p . we have 

L 2 = P (\\A - A|| ■ IIS^II + US; 1 - S; 1 !!) = P (h n + (p l - s s- l ) l / 2 h n ). (7.21) 

Hence, from (IT. 19 j) . together with (I7.20p and (I7.2ip . we have 

HsT 1 - s; 1 !) = o P ((i + Grt- 1 ) 1 / 2 )/^), 

which completes the proof of the theorem. □ 
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Proof of Theorem HI The idea of the proof is similar to that for Theorem [T] for the 
simple procedure. We want to find the order of the eigenvalues of the matrix L first. 
From model (I4.8j) . we have for i — 1, 2, 

||E 12 (fc)|| 2 = 0{p 2 ~ 5 ^) = \\T, 2l {k)\\\ (7-22) 
\\V ie {k)\\ 2 = 0{p 2 ^). 

We want to find the lower bounds of the order of the ri-th largest eigenvalue, as well as 
the smallest non-zero eigenvalue of L. We first note that 

£ y (fc) = AxSu^Af + A 1 S 12 (A;)A^ + A 2 £ 21 (fc)Af 

+ A 2 S 22 (A;)A 2 r + A 1 H le {k) + A 2 E 2e (A;), (7.23) 

and hence 

L = AiWiAf + A 2 W 2 A 2 + cross terms, (7.24) 

where Wx (with size ri x ri) and W 2 (with size r 2 x r 2 ) are positive semi-definite matrices 
defined by 

fe 

Wl = {Sii(A:)Sii(A;) T + £ 12 (fc)£ 12 (£;) T + S le (A;)S le (A;) T j, 

k=l 

ko 

W 2 = J2 {s 21 (A;)S 21 (A;) r + E 2e (A;)E 2e (A;) T + E 22 (A;)£ 22 (A;) T 
k=i 

From Wi, by (17.221) and that ||S le (/c)|| = o(p 1 ~ Sl ), we have the order of the r 1 eigenvalues 
for Wi is all p 2 ~ 2Sl . Then the r x -th largest eigenvalue of L is of order p 2 ~ 2Sl since the 
term En(fe)Sn(fe) T has the largest order at p 2 ~ 2Sl . We write 

p 2 " 25l =0(A ri (L)), (7.25) 

where Aj(M) represents the i-th largest eigenvalue of the square matrix M. 

For the smallest non-zero eigenvalue of Wo, since ||S 2e (A;)|| = oip 1 ' 52 ), it is con- 
tributed either from the term E 22 (/c)E 22 (/c) T or S 2 i(A;)S 2 i(/c) T in W 2 , and has order 
p 2 - 25 * if ||S 2 i(fc)|| min = 0{p 2 ~ 2 ^), and p 2 ~ c in general if ||S 21 (fc)|| min x p 2 ~ c , with 
#i + #2 < c < 2<5 2 . Hence 

p 2 ^ = 0(A ri+r2 (L)), if UEaiWIU = o(p 2 "^), 
p 2 " c = 0(A ri+r2 (L)), if ||S 2 i(A;)|| min xp 2 - c , 8 1 + 5 2 < c < 26 2 . 
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Now we can write L = (A B)D(A B) T , where B is the orthogonal complement of A, 
and with D x containing the r\ largest eigenvalues of L and D 2 the next r 2 largest, 



D = D 2 , (7.27) 




withp 2 " 251 = 0(A min (D 1 )) by ([125]), andp 2 " c = 0(A min (D 2 )) by (HMD, 5 t +5 2 <c< 25 2 . 
Similar to the proof of Theorem [Tj we define El = L — L. Then (17.3)) holds, and 

||E y (*)ll = 0(p 1 -' 1 ), (7.28) 
using (EL2SD and ||En(A;)|| = 0(p 1 ~ h ) from ([7321 . Also, 

||£ y (fc) - E y (fc)|| = Op(||Eii(A;) - Eu(fc)|| + ||E 12 (A;) - E 12 (A;)|| 
+ ||E 22 (A;) - E 22 (A;)|| + ||E le (fc) - S l6 (A;)|| 
+ ||E 2e (A;)-E 2e (A ; )|| + ||E e (fc)-E e (fc)||) 
= Opip 1 - 51 ^ 11 +p l - s ' 2 n- h +p l ^l 2 ^/ 2 n - h2 
+ p 1 ' 51 ' 2 ^^ + p 1 -^/2 n -2 2E + pn -i^ = o P (p 1 ~ 5l uj 1 ), (7.29) 

where we used condition (D') in section 14.21 to derive the following rates like those in 
Lemma [1] (proofs thus omitted): 

||E u (A0 - E u (fc)|| = Opip 1 ' 5 ^), ||E 12 (fc) - E 12 (fc)|| = Opip 1 ^/ 2 ^ 2 ^^), 
||E 22 (fc) - E 22 (fc)|| = Opip^n- 1 *), ||E le (fc) - E le (A;)|| = P {p l ^ 2 n~ 1 ^), 
||E 2e (A;) - E 2e (A;)|| = Opip 1 ^^^), ||£ 6 (fc) - S 6 (A;)|| = P {jmT 1 '). 

(7.30) 

We form Ax with the first r\ unit eigenvectors corresponding to the ry largest eigen- 
values, i.e. the eigenvalues in D x . Now, we have 

IIEi.ll = P {\\f. y {k) - £ y (A;)|| • (||£ y (A;)|| + ||£ y (A;) - £ y (A;)||)) = Opip 2 ' 2 ^) 
— °p(p 2 ~ 2Sl ) — Op(A min (Di)), hence for n large enough, 

||E L || < ^sep^Di. ^ 



5 ^ V. V 

where the second equality is from (17.28)) and (17.291) . the third is from noting that u\ = 
o(l), and the last is from ( 17.251) . Hence, we can use Lemma [2] and arguments similar to 
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the proof of Theorem [T] to conclude that 

llAx - Ax|| = OpfjlErJI/sepfDi, ( ° 2 ° ) )) = OpW- 
Similarly, depending on the order of ||S2i(A;)|| m ; n , we have 

IIElII = Op(p 2 -^ Wl ) < isep( D2 , ( ^1 ^ xp2 _ C) 
since we assumed p c_2<5l u;i = o(l) for 5i + 5 2 < c < 2<5 2 . Hence 

II A 2 - A 2 || = OpMlE1.il/sepm2, ( ^ ) )) = °p(p C ~ 25i ui) = P (co 2 ). 
This completes the proof for the simple procedure. 

For the two-step procedure, denote yl = (I p — AiA|")y t , and define E L . = L* — L*. 
Note that 

fco 

||E L ,|| = ||L*-L*|| <^{||E y .(fc)-S y .(fc)|| 2 +||S y .(fc)||.||S y .(fc)-S y .(fc)||}, (7.31) 

fc=i 

where 

X r (k) = (I p - A 1 Af)S y (A;)(I ?) - AiAf), 

E y .(A;) = (I p - A 1 Af)S y (fc)(I p - A ± A^) = A 2 H 22 {k)A T 2 + A 2 E 2e (A;)(I p - A^), 

fco fcp 
fc=l k=l 

with Ai being the estimator from the simple procedure, so that ||Ai — Ai|| = Op(ui) 
from previous result. We write 

L* = AaQaD^Q^A^, so that L*A 2 Q 2 = A 2 Q 2 D;, 

and like section |2~3"| we take A 2 Q 2 as the A 2 to be used in our inference. 

The idea of the proof is to find the rates of ||El* || and the eigenvalues in and use 
the arguments similar to the proof for the simple procedure to get the rate for || A 2 — A 2 1| . 

First, with the assumption that ||S 22 (A;)|| x p 1 ' 52 x ||E 22 (A;)|| mhl and ||S 2e (A;)|| = 
o(j9 1-<52 ), all the eigenvalues in D 2 have order p 2 ~ 2S2 . 
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We need to find ||El*||. It is easy to show that 

HAxAf A 2 || = HA^Ax - A 1 ) T A 2 || < ||A X - A : | 
||(1, - AiAf)Ai|| = || (Ai - Ai) - Ai(Ai - A 1 ) T A 1 \\ 

Writing H] - T * * T 



= Op(ui), 
OpM. 

r9 



(7.32) 



I p — AiAf , we can decompose S y *(/c) — £ y * (k) = Yli=i 4 where 



h = HiAiSnC^AfH!, J 2 = H^E^Af H l9 
J 3 = H 1 A 1 Ei 6 (A;)Hi, I 4 = H 1 A 2 E 21 (A;)AfH 1 , 

h = HiA 2 S 22 (A;)A 2 r H 1 - A 2 S 22 (A;)A 2 r , J 6 = H 1 A 2 E 2e (A;)Hi - A 2 S 2e (A;)H 1 , 
J r = H 1 S 6 i(A;)A? , Hi, h = Hi£ 62 (fc)Al'Hi, J 9 = H 1 S e (A;)H 1 . 

Using (17.301) . (17.321) and the assumptions in model (j4.8j) . we can see that 



II A 1 


= P ( 


HHiAif. ||S U (A:)||) = P (p 1 -V), 






1141 


= o P { 


\\H 1 A 1 \\ ■ ||S 12 (fc)||) = Opip 1 -^ 2 - 6 ^) = 


1411, 




1141 


= P ( 


||/ 3 ||) = Op(||H 1 A 1 ||.(||S le (A;)-S le (fc)|| + 


\\V l£ (k)\\)) = P (p 1 




1141 


= o P { 


||S 2e (A;) - E 2e (A;)|| + ||S 2e (A;)|| • ||Ai - Ai||) 


= Op(p l - 52/2 (n- 1 ^ H 




ll^l 


= P { 


\\^2(k)\\) = Op(p 1 ^ /2 n- 1 ^), ||/ 9 ||=Op(|| 


%\\) = Op(pn- l <), 




1141 


= o P ( 


HAxAf A 2 || • ||E 22 (fc)ll + ||S 22 (A;) - S 22 (A;)||) 


= Optf-^fa + n 





Hence, we have 



|£ y *(A;) - S y *(fc)|| = Opip^ool + p 1 ' 5 ^ 2 -^ 2 ^ +p 1 ' 5l uj l 



--Opip 1 ' 5 ^). (7.33) 



We also have 



||£ y *(A;)|| = Op(||S 22 (A;)|| + ||E 26 (fc)||) = P (p^] 
Hence, with ([733]) and fTHMl) . (1731]) becomes 

||E L .|| =0 P (p 2 ~ 5 ^u 1 ). 



(7.34) 



(7.35) 



With the order of eigenvalues in D 2 being p 2 262 and noting (I7.33p . we can use Lemma 
[2] and the arguments similar to those in the proof of Theorem [T] to get 



I A* - A, 



Op(||E L *||/sep(D*,0)) = Opip 5 *- 5 ^), 
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and the proof of the theorem completes. □ 

Proof of Theorem 0, We have % = A T y t = A T Ax t + (A - A) T e t + A T e t . With 
A = (Ai A 2 ) and x t = (x^ x^ t ) T , we have 

x w = A^A lXli + A^A 2 x 2i + (Ai - Ai) T e t + Aje t , 
x 2t = A^A^u + AlA 2 x 2t + (A 2 - A 2 ) T e t + A T 2 e t . 

We first note that for i = 1,2, ||(A; - A^eJ = P (Aje t ) = O p (1) since ||A 4 - A 4 || = 
op(l) and Afej are Op(l) random variables. Then 

||xit|| = ||xi t || F > || Af AiX lt ||jr - ||Af A 2 X2t||f + P (1) 

> ||Af Aill^ • ||x lt || F - ||Af A 2 || • ||x 2t || + Op(l) 

^ ||Ai|| m i n ■ || Ax|| m in ||xu|| - Op(||x 2 J) + Op(l) 

x P ||x u || X p 2 ; 

where ||M||p denotes the Frobenius norm of the matrix M, and we used the inequality 
||AB||p > ||A|| min • ||B||p. Finally, with similar arguments, 

||x 2t || < ||x 2t || + ||A|A 2 || • ||x w || +O p (1) 
= Op(p^) + P (||A 2 - A 2 || • p^) 

l-6 2 l-6 1 

= P (p~) + o P (p~) 

l-5 1 

= Op{p 2 ), 

which establishes the claim of the theorem. □ 

Proof of Theorem US. We can easily use the decomposition in (17. 6p again for model 
(14. 8 p to arrive at 

||S y - S y || = Opip^rT 11 +p l - 62 n~ l2 + p 1_5l/2 n^ le + p 1 '^^ 1 ^ +pn~ h ), 
= Opip 1 ' 6 ^), 

where we used assumption (D'), and arguments like those in Lemma [T] to arrive at ||S X — 
S x || = P {p l ~ 5l n- h +p 1 ~ 52 n- h ), ||S X;£ - E x>e || = P {p l ' 5l/2 n- h - + p L - Sa / 2 rT h ') an d 
||S e - E e || = P (pn- 1 '). 
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Now consider ||£ y — £ y ||. The proof for ||£ y — S y || follows exactly the same lines by 
replacing A with A and is thus omitted. It can be decomposed like that in (17. 7p . Hence 
we need to consider ||£ e — £ e || = maxi<j< p |a| — <rj| < h + h + I3, where 

/ x =Op(p 1 - <5l s^ 1 ||A- A|| 2 ), 

which used decomposition in (17. 8p . and HXH^ = Op(p 1-<5l nr 1 + p l ~ 52 nr 2 ) = Op(p 1 ~ Sl n); 

I 2 = P (n- 1 * + s~ l ), 

1 /2 

where derivation is similar to that in (17.91) and thereafter. Also, ^3 = Op(/ 1 ) = 
P ((p 1 ~ 5l s~ 1 ) 1 / 2 ||A - A||). Thus, with assumption (M2)', we see that 

||£ e - £.|| = Op((p 1 ^^ 1 ) 1 / 2 ||A - A||). (7.36) 

For || £ x — £ x ||, we use the decomposition like that in the proof of Theorem [21 and 
noting that ||£ x || = 0(p l ~ &1 + p l ~ &2 ) = 0(p 1 ^ Sl ), to arrive at 

||£ x - £ x || = Op{p 1 -^ || A - A||). (7.37) 
Hence noting assumption (M2)' again and combining (17. 36p and (17.371) . we see that 

||£ y -£ y ||=Op(^lA-A||), 
which completes the proof of the theorem. □ 

Proof of Theorem [71 We omit the rate for I] y since it involves standard treatments 

like that in Theorem [3j Also the proof for || S y 1 — S" 1 !! follows exactly the arguments 
— 1 1 

below for ||£ — S y ||, and is thus omitted. 
Note that (I7.19P becomes 

- S^H = P ((1 + (y-^-Y^HA - A||) + OpiL, + L 2 ), 

with L\ and L 2 defined similar to those in the proof of Theorem [31 Similar to ( I7.20p and 
(I7.2ip . we have respectively 

L x = P (p-^\\A - A||), L 2 = P ((1 + (p^s^Y^WA - A||), 

which shows that ||S y 1 - S y 1 1| = P ((1 + (p l ~ &1 s" 1 ) 1 ! 2 ) || A - A||). This completes the 
proof of the theorem. □ 
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